Bioinformatics (Thomas Dandekar, Meik Kunz)

204

Clustering In cluster analysis (group analysis), a distinction is made between supervised

(groups known) and unsupervised clustering (groups unknown). An example of super

vised clustering is the k-nearest-neighbour algorithm, in which new data (e.g. patient) are

classified into predefined groups (the k-nearest neighbour [e.g. k = 3 considered 3-nearest

neighbours] is always considered and then assigned to a cluster). This allows, for example,

to assign a diseased person to an optimal therapy (e.g. radiation, chemotherapy) according

to the gene expression profile. If, on the other hand, one wants to find clusters in one’s

data, one can apply unsupervised clustering, such as k-means (non-hierarchical; often for

NP problems) or complete-linkage (hierarchical).

Regression Regression analyses examine the relationship between a dependent (regres

sand, criterion, “response variable”) and independent (predictor, “predictor variable”)

variable. For example, a linear regression can be used to examine the relationship between

weight (independent variable) and blood pressure (dependent variable). The prerequisite is

that the dependent and independent variables are metric. The calculation is done with the

least-squares estimator, which tries to minimize the least-squares error of the residuals

(distance from the data point to the regression line) to get the best fit to the data (you put

a regression line in the data). How well the regression model represents the data (goodness

of fit) is usually calculated with a t-test (p-value < 0.05) and the R² (coefficient of determi

nation, between 0 [no correlation] and 1 [high linear correlation]).

If, on the other hand, the dependent variable is a binary/dichotomous (yes/no) variable,

logistic regression can be used. Here is a popular analysis question: What is the probability

(the correlation) of developing high blood pressure (or heart failure) if you are overweight?

The calculation is done using the logit function [log(p/1−p)] and maximum likelihood

method (you put a sigmoidal curve in the data). To assess the model quality, one uses, for

example, a chi-square test (p-value <0.05), the R² and the AIC (Akaike Information

Criterion; adjusts model fitting to the number of parameters used).

Often one also has time points (time data) for a dependent variable, for example in the

context of a follow-up study. Cox regression is used for this (survival time analysis, “time

to event” analysis). Survival time analyses are of interest if one wants to know, for exam

ple, what the influence of a mutation or therapy is on the 5-year survival time. The calcula

tion is done using the Kaplan-Meier estimator (hazard function calculates risk [failure

rate] that event actually occurred; censored data [no exact information about event] are not

included in the calculation). The survival rates are represented in a Kaplan-Meier curve,

and the model quality is assessed using a log-rank test and Cox proportional hazards.

A nice overview of regression analysis is provided by Worster et al. (2007), Schneider

et al. (2010), Singh and Mukhopadhyay (2011) and Zwiener et al. (2011). The two recent

papers on remdesivir (Wang et al. 2020) and lopinavir-ritonavir (Cao et al. 2020) treatment

in COVID-19 should also be mentioned here.

Logistic regression and Cox regression are popular for the analysis of diagnostic and

prognostic signatures, i.e. the optimal combination of genes (Vey et al. 2019) or

14 We Can Think About Ourselves – The Computer Cannot